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Abstract 

The role of variable regions of HIV-1 gp120 in immune escape of HIV has been investigated. However, there is scant 
information on how conserved gp120 regions contribute to virus escaping. Here we have studied how molecular sequence 
characteristics of conserved C3, C4 and V3 regions of clade C HIV-1 gp120 that are involved in HIV entry and are target of 
the immune response, are modulated during the disease course. We found an increase of "shifting" putative N-glycosylation 
sites (PNGSs) in the a2 helix (in C3) and in C4 and an increase of sites under positive selection pressure in the a2 helix during 
the chronic stage of disease. These sites are close to CD4 and to co-receptor binding sites. We also found a negative 
correlation between electric charges of C3 and V4 during the late stage of disease counteracted by a positive correlation of 
electric charges of a2 helix and V5 during the same stage. These data allow us to hypothesize possible mechanisms of virus 
escape involving constant and variable regions of gp120. In particular, new mutations, including new PNGSs occurring near 
the CD4 and CCR5 binding sites could potentially affect receptor binding affinity and shield the virus from the immune 
response. 
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course of the disease, generating HIV variants resistant to 
neutralizing antibodies. Similar mechanisms of escape have been 
observed in different HIV-1 subtypes of the M group [10,15-18], 
although subtype-specific differences may be present [9,19]. 

In HIV-1 subtype C, the most widespread HIV clade worldwide 
and the prevalent one in Southern Africa [20,21], we and others 
have shown that during the chronic stage of the disease the VI and 
V4 variable regions of gpl20 undergo a significant increase of 
their aminoacid sequence length and new PNGSs are generated 
[9,20], while the V5 undergoes changes in its net electrical charge 
during the course of the disease. These changes may play a role in 
glycan packing and immune evasion [9,14,22]. 



Introduction 

Human immunodeficiency virus type 1 (HIV-1) gpl20 envelope 
protein is characterized by a remarkable genetic variability [1-3], 
due to extensive replication of the virus and to a high mutation 
rate of the reverse transcriptase [4]. This variability is largely 
responsible for the ability of the virus to escape the host immune 
response, in particular neutralizing activity of antibodies [5-8]. 
The role of the variable regions of gp 1 20 (V 1 to V5) in promoting 
virus escape has been reported by several groups, including ours 
[9-15]. These studies have shown that both amino acid sequence 
length and number of putative N-glycosylation sites (PNGSs) in the 
V 1 , V2 and V4 variable regions of the protein change during the 
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Few studies have been conducted to date that have investigated 
the modulation of molecular sequence characteristics of the 
constant regions of gpl20 during the course of the disease, even 
though these regions are responsible for HIV entry into the cell, 
and are important targets of the immune response, including 
neutralizing antibodies. The V3 domain, known to be variable in 
clade B and D viruses, is relatively conserved in HIV-1 subtypes A, 
C, G, E and H [Los Alamos website. Available:http:/ /www. 
hiv.lanl.gov/content/sequence/HIV/COMPENDIUM/1999/7/ 
V3LoopEvolution.pdf. Accessed 2014 March 31, and 9, 23, 24]. It 
plays a critical role in virus entry, since it binds both CGR5 and 
CXCR4 receptors and is a target of neutralizing antibodies 
[9,23,25,26]. The C3-V4 region has been reported to be a major 
target of the early autologous neutralizing response in HIV-1 
subtype C infection and the N-terminal portion of C3, which 
contains the oc2 helix region, may play a role in virus escape 
processes [6,24,27,28]. It has also been reported that some 
residues in C3 bind the CD4 receptor, in association with amino 
acids in the C4 region [29-32]. 

In the present study, we have therefore investigated how amino 
acid characteristics of the constant C3, C4 and V3 regions of HIV- 
1 subtype C (i.e. sequence length, number of PNGSs, number of 
sites under positive selection pressure and electrical charges) are 
modulated during the course of the disease. We used a novel 
integrative bioinformatics sequence analysis which allowed us to 
hypothesize a model to explain how conformational changes in 
these regions may regulate virus escape from the immune 
response. 

Materials and Methods 

Ethics Statement 

The study was approved by the Committee for Research on 
Human Subjects (Medical) at the University of Witwatersrand and 
by the Ministry of Health and Social Welfare Scientific Ethical 
Committee of Swaziland. All participants provided written 
informed consent to participate in the study. 

Sample collection, determination of disease stage and 
description of the cohorts 

A total of 72 plasma samples were collected from in the time 
period 2005 to 2007 from HIV-positive individuals, living in South 
Africa and Swaziland, all naive for antiretroviral therapy (ART); 
24 samples were obtained from the 2006 Swaziland HIV National 
Serosurvey [33,34], 24 samples from the Chris Hani Baragwanath 
Hospital (CHBH) in Soweto, Johannesburg, South-Africa, in the 
framework of the activities included in the AVIP project 
(European Commission website. Available: http:/ /ec.europa.eu/ 
research/health/ infectious-diseases/ poverty-diseases/ projects/84_ 
en.htm. Accessed 2014 March 31) and 24 samples from the HIV/ 
AIDS National Referral Laboratory (NRL) at the Government 
Hospital in Mbabane, Swaziland, in the framework of activities 
linked to the Italian AIDS Programme. Ethical clearance for these 
studies was obtained from local Ethical Committees. 

The 72 patients were classified for disease stage into 3 groups, 
based on Avidity Index (AI) assay data [34,35] and CD4+ T cell 
number, as previously described [9]. Briefly, by AI assay we 
identified those HIV infections that occurred within 6 months 
from blood drawing, independently of their CD4+ T cell number. 
A number of 24 patients were classified as recently infected (ES = 
Early disease Stage). All these patients were from Swaziland, 
collected in the frame of the 2006 Swaziland National Serosurvey 
[33,34]. Patients that by AI were identified having an established 
infection were further classified into patients at chronic or late 



disease stage, according to their CD4+ T cell number. Specifically, 
24 patients with a CD4+ T cell count >200/u,l were classified in 
the chronic disease stage group (CS = Chronic disease Stage), 
whereas 24 patients with a CD4+ T cell count £200/|xl were 
included in the late disease stage group (LS = Late disease Stage). 
CD4+ T cell number median values were 544,5/ ul (values ranging 
from 418 to 81 1) and 122/jxl (values ranging from 7 to 199) for CS 
and LS groups, respectively. The 24 patients included in the CS 
group were from South Africa. Samples from 6 out of these 24 
patients were collected in 2005. The remaining 18 samples were 
collected in 2006. The 24 patients included in the LS group were 
from Swaziland. Samples from 1 3 out of these 24 patients were 
collected in 2006. The remaining 1 1 samples were collected in 
2007. 

CD4+ T cell count 

CD4+ T cell count was performed by means of MultiTEST and 
TruCOUNT tubes (Beckton Dickinson) according to the manu- 
facturer's instruction. 

Viral RNA extraction and sequence amplification 

Viral RNA was extracted from 0.5 ml of plasma using Qiamp 
viral RNA mini Kit (Qiagen), after treatment with Heparinase 
(Sigma). 

The VI to V5 coding region in the env gene was amplified by 
RT-PCR using SuperScriptTM One-Step RT-PCR System 
(Invitrogen) and specifically designed outer primers for clade C. 
The following primers were used: AC-Env Outer For = 
5 ' C AGATGC ATGAGGATATAATC A3 ' ; ED 12 m Outer Rev 
= 5'AGTGCTTCCTTGCTGCTCCCAA3'. The RT-PCR mix 
was composed of 0.5 to 1 (Xg of RNA and an RT/Taq buffer 
mixture containing dATP, dCTP, dTTP, dGTP 0.4 mil (Roche), 
MgS0 4 2.4 mM (Invitrogen), 1 u.1 RT/Taq-platinum (Invitrogen), 
RNase inhibitor 40 U/u.1 (Invitrogen), and primers 10 uM (MWB- 
Biotech); RT-PCR reaction was carried out using a thermal cycler 
(Eppendori) and the following program: 45°C for 30' for retro- 
transcription and 94°C for 2' for the RT denaturation; the 
resulting cDNA was amplified as follows: 94°C for 15", 50°C for 
30" and 68°C for 1 '30"for a total of 40 cycles and a final extension 
step at 68°C for 7'. 

The resulting PCR product was again amplified in a nested 
PCR, using specifically designed primers in order to obtain 
nucleotide sequences corresponding to the V3-C3 and the V4-C4- 
V5 Env regions, respectively. Primers specific for each region were 
the following: for the V3-C3 region: V3C3ForA = 
5'CTGTTAAATGGTAGCCTAGC3' and V3C3RevA = 
5 ' GC AATAGAAAAATTCTCC 3 ' . If amplification failed, a 
nested PCR was repeated using another couple of inner primers, 
i.e. V3C3ForB = 5 ' C AC AGTAC AATGTAC AC ATG3 ' and 
V3C3RevB = 5'RCAATAGAAAAATTCTCCTC3'; for the 
V4-C4-V5 region: V4-AFor5' = 5 ' GTRG AGG A- 
GAATTTTTCTATTG3 ' , V5-ARev5' = TATAATT- 
CACTTCTCCARTTGTC. If amplification failed, a nested 
PCR was repeated using another couple of inner primers: V4- 

V5-BRev5 ' TATTTATATAATTC ACTTCTCC AATTGTC 3 ' . 
Each nested PCR reaction was conducted as follows: 1-5 u.1 
aliquot of the amplified product was added to a reaction mixture 
containing dNTPS 200 uJVI (Roche), MgCl 2 2.5 mM (Invitrogen), 
the couple of 20 U.M primers corresponding to the region that had 
to be amplified (MWB-Biotech) and AmpliTaq Gold DNA 
polymerase 2.5 U/u.1 (Applied Biosystems). Amplifications were 
carried out specifically for each region, as following: V3-C3: 96°C 
for 7' (1st cycle), then 15"at 94°C, 30" at 50°C, 30" at 72°C, for 40 
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cycles; V4-C4-V5 region: 96°C for V (1st cycle), then 15" at 
94°C, 30" at 44°C, 30" at 72°C, for a total of 40 cycles and a final 
extension step of 7' at 68°C. 

DNA purification and sequencing 

Amplified DNA of each region was purified using Qiaquick 
PGR purification kit (Qiagen), according to the manufacturer's 
protocol. The DNA samples were then quantified and checked for 
purity by measuring absorbance at 260 nm and 280 nm. The V3— 
C3 and V4— C4-V5 amplified regions were then sequenced using 
BigDye (Applied Biosystem), with the same primers as the nested 
PGR. Sequencing was performed on uncloned PCR products to 
identify the prevalent viral quasispecies. 

The electropherogram analysis was edited using Chromas Pro 
(Technelysium website. Available: www.technelysium.com.au/ 
ChromasPro.html. Accessed 2014 March 31). All sequences were 
aligned using ClustalW by Bioedit [36] and corrected for the 
multiple alignment by manual editing. The nucleotide sequences 
were translated into amino acid sequences using GeneRunner 
(www.generunner.net) and further codon-aligned using BioEdit 
[36]. 

Phylogenetic analysis 

Phylogenetic analysis was based on the V3-V5-coding region 
from 72 HIV-1 variants isolated from the 72 patients (one variant/ 
patient) included in the cohorts. It was carried out using the PAUP 
software (version 4.0) (http:/ /paup. csit.fsu.edu/downl.html) with 
the K8 1 model of substitution and using both Neighbor-Joining 
(NJ) and Maximum Likelihood (ML) treebuilding methods. The 
evolutionary model was chosen as the best fitting nucleotide 
substitution model in accordance with the results of the 
Hierarchical Likelihood Ratio Test (HLRT) implemented in the 
MODELTEST software (version 3.6) [37]. 

The parameters for the nucleotide substitution model were 
estimated by the ML method using a NJ tree Jukes-Cantor 
distance) as the base tree [38]. 

The statistical robustness and reliability of the branching order 
within phylogenetic trees were confirmed by either a bootstrap 
analysis using 1000 replicates, for the NJ tree, or the zero branch 
length tests for the ML tree. All calculations were performed with 
PAUP software (version 4.0). 

Analysis of predicted co-receptor usage and 
identification of PNGSs 

The analysis of predicted co-receptor usage was performed by 
using Geno2pheno (Genafor website. Available: http:// 
coreceptor.geno2pheno.org/index.php. Accessed 2014 March 
31) upgraded version, June 17, 2011. 

Identification of PNGSs was performed by using N-GLYCO- 
SITE (Los Alamos website. Available: http:/ /www.hiv.lanl.gov/ 
content/sequence/GLYCOSITE/glycosite.html. Accessed 2014 
March 31). 

Analysis of electric charges 

To determine the electric charge of each region, the algebraic 
sum of all charged residues, both positive and negative, was 
considered as total electric charge (Q,„ t ). The sum of all positive 
charged residues was considered as total positive charge (Qp OS ) and 
the sum of all negative charged residues as total negative charge 

(Qneg). 

An original Matlab (the MathWorks Inc. USA) program was 
written to yield the distribution of Q, ot , Qp os and Q^eg directly 
from the *.fas files with the aligned sequences. Charge distribu- 



tions were analyzed with the Wilcoxon's rank sum test, after a 
preliminary checking of the non-Gaussian nature of the distribu- 
tions using the Kolmogorov-Smirnov test, as previously described 
[39]. Finally, the charge correlation coefficient was calculated with 
Madab. 

Codon specific dN/dS ratio 

The CODEML program implemented in the PAML 3.14 
software package (UCL website. Available: http://abacus.gene. 
ucl.ac.uk/ software/paml.html. Accessed 2014 March 31) was used 
to investigate the adaptive evolution of the V3, C3 and C4 regions 
of HIV- 1 env gene [40] . For each region, sequence alignment was 
performed prior to testing which amino acids were under positive 
selection. Six models of codon substitution, MO (one-ratio), Mia 
(nearly neutral), M2a (positive selection), M3 (discrete), M7 (beta), 
and M8 (beta and omega) were applied in this analysis [41]. Since 
these models are nested, we used codon-substitution models to fit 
the model to the data using the likelihood ratio test (LRT) [42] . 

The "Bayes empirical Bayes" (BEB) approach implemented in 
M2a and M8 was used to determine the positively selected sites by 
calculating the posterior probabilities (P) of the different classes for 
each site [43]. 

Finally, the Shannon's entropy analysis was performed by using 
the Entropy software (Los Alamos website. Available: http:/ /www. 
hiv.lanl.gov/ content/ sequence/ENTROPY/ entropy_one.html. 
Accessed 2014 March 31). 

The strength of positive selection (i) for each amino acid site was 
calculated as previously described [29], by defining the Weighted 
Mean Value (co,) as: 

n 

cot = y^jkiWki 
k=i 

where C0jy represents the k-th a> value for the i-th site, and the 
weight fid is the associated posterior probability, according to 
model M8 (10 values of ff> are chosen for the discrete beta 
distribution, and another value of CO is not constrained by the beta 
distribution and is allowed to be greater than 1). For each pair of 
data set, each one associated to a disease stage, we tested whether 
the strength of positive selection was significantly different (HI), as 
opposed to being equivalent (HO), by using a paired Wilcoxon rank 
sum test with a continuity correction applied to the normal 
approximation for the P values [44] . Only shared sites having a 
weighted mean a> ; value greater than 1 in the two data sets being 
compared were included in the test. 

Three-dimensional mapping and contact maps 

The core structure used for mapping the positive selected 
residues corresponds to the X-ray structure of CD4-bound JR-FL 
gpl20 (Protein Data Bank [PDB] ID: 2B4C) [45]. This structure 
was chosen because it provides the spatial representation of V3, 
and is therefore useful for the identification of the PS residues in 
V3 (besides those in C3 and C4) in the appropriate structural 
context. Moreover, the chosen structure derives from a strain that 
has the same co-receptor usage as the majority of the isolates 
hereby characterized. Three-dimensional images were generated 
using the pyMOL v. 1.3 software (PyMOL website. Available: 
http://www.pymol.org/ Accessed 2014 March 31). 

For generation of the contact maps, we referred to the same 
gp 1 20 structure used for 3D mapping [45] . The contact map is a 
distance matrix that identifies, in a known 3D structure (e.g., taken 
from Protein Data Bank), the proximal contact between two areas. 
The contact distances between C-alpha carbons were used to 
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identify interactions between regions or differences between the 
clades. A 10 A cut-off distance between the C-alpha carbons was 
considered as a measure for contact [27]. Contact maps were 
generated with CMView vO.9.6, after loading the appropriate 
molecular structure by specifying its PDB identification number. 
The software allows to highlight relevant regions (e.g. C3 or C4 of 
Env) or single residues of the loaded structure. 

Statistical analysis 

The Wilcoxon rank sum test was used to evaluate the statistical 
significance of the difference between parameters analysed in the 
study (if not explicitly stated otherwise). 

Furthermore, to measure existing correlation between sequence 
length and glycosylation status, the Spearman's correlation test 
was used. All analyses were conducted using the Stata 8.2 software. 

Results 

Phylogenetic analysis 

We amplified and sequenced the V3-V5-coding regions of HIV 
variants isolated from 72 HIV-infected individuals living in South 
Africa and Swaziland, naive for ART during the years 2005 to 
2007. A phylogenetic analysis based on the V3-V5 sequence 
(Fig. 1) revealed that all patients were infected by HIV-1 subtype C 
variants, as also previously described [9]. In the figure virus 
sequences from patients have been highlighted with different 
colors according to year of collection of sample: 2005 in blue, 2006 
in red and 2007 in green. Since we wanted to study if and how 
sequence molecular characteristics changed during the evolution 
of the disease, we had to verify that the natural Env sequence 
evolution through the three year period would not have influenced 
the molecular analyses we performed. As it is evident in figure 1, 
the different years are well intermixed in the figure, indicating that 
the sampling year was not a main causal factor for the observed 
differences in the sequence characteristics of the Env regions we 
have considered for our studies. Furthermore, the phylogenetic 
analysis in figure 1 shows an intermixing of all the sequences, 
independendy of the patient region of origin, indicating that a 
possible, if any, difference in the ethnical background of the two 
populations from Swaziland and South Africa does not constitute a 
bias for the further analysis we performed. 

Amino acid sequence length and PNGS distribution 

Lengths of the amino acid sequences of the gpl20 V3, C3 and 
C4 regions from the HIV-1 infecting virus variant obtained from 
each one of the 72 patients, as well as the predicted tropism of the 
variant, are reported in Table 1. As shown in the Table, the 
median length of amino acid sequences of the V3, C3 and C4 
regions did not significandy vary with the disease stage. The V3 
sequence length was equal to 35 amino acids in virus variants from 
66 out of 72 samples, 34 amino acids in 5 variants, and 37 amino 
acids in 1 variant. In the variants with a V3 sequence of either 34 
or 37 amino acids, no correlation with disease stage was found. In 
fact, of the five variants with a V3 sequence of 34 amino acids, 2 
were from ES, 2 from CS and 1 from LS patients. The only 
variant with a V3 sequence of 37 amino acids was obtained from 
an LS patient. Regarding the predicted viral tropism, the three 
sample groups presented a similar pattern with 20 out of 24 
variants being R5 tropic and 4 being X4 tropic in the ES group; 
19 R5 and 5 X4 tropic in the CS group and 18 R5 tropic and 6 
X4 tropic in the LS group. 

Amino acid sequence length of the C3 region ranged from 50 to 
54 amino acids. No significant differences in sequence length 
among the disease stages were observed (Table 1), although a 



tendency to a broader length variability was detected in the ES 
group where 15 variants had a C3 region of 52 amino acids, 4 of 
53, 2 of 51, 2 of 50 and 1 of 54. In the CS and LS groups, the C3 
region was more conserved with 21 variants bearing a C3 region 
of 52 amino acids in both disease stages. The remaining 3 variants 
in the CS group had a C3 region of 5 1 amino acids, whereas in the 
LS group, 2 had a C3 region of 53 amino acids and 1 of 51 amino 
acids. The C4 region length was of 41 amino acids in all 72 
variants. 

Table 2 reports the mean and median number of PNGSs for 
each region and for each disease stage. Overall, both the median 
and the mean number of PNGSs of each region did not show 
statistically significant variations among the three disease stages. A 
qualitative analysis of the distribution of PNGSs among the 
different groups of patients is reported in figure 2. A number of 
PNGSs, in particular, PNGS N301 (in V3), N332, N339 and N356 
(in C3), and N442 and N448 (in C4) are conserved in variants 
obtained from patients at all the disease stages with a frequency 
higher than 70%, with the exception of PNGS 339 (numbering 
according to HXB2 sequence) in the C3 region that is present in 
virus variants obtained from patients at the LS stage where it is 
represented with a frequency of about 60%. For this reason these 
PNGSs have been named "fixed". Besides these, other PNGSs 
were found with a lower frequency in all the regions. Most of these 
PNGSs are represented with a frequency that is often <20% and 
are not constantly present in virus variants obtained from patients 
at all disease stages. Therefore, they have been named "shifting" 
PNGSs, in agreement with previous reports [9,14,46,47]. The 
majority of the shifting PNGSs are present in the C3 region in 
variants obtained during the chronic stage of the disease, in 
particular in the oc2 helix (amino acids 335 to 352) and in the N- 
terminal and in C-terminal portions of the C3 region. The PNGS 
at position 362 was found with a frequency of 33.3% in the 
variants from the CS group (encircled in the figure), whereas it is 
present at much lower frequency in variants from the ES and LS 
groups. A shifting PNGS between N442 and N448 in the C4 region 
was detected only during the late stage of the disease (encircled in 
the figure). Of note, we also found a low frequency of expression, 
at all disease stages, of the glycosylated residue 295, which has 
been described to be highly conserved in other HIV-1 subtypes 
[47] that comprehends the last amino acid in the C2 region and 
the first two in the V3 region. Since this PNGS is located between 
the C2 and the V3 regions, it has not been reported in figure 2. 

Analysis of positive selection 

For each disease stage, we evaluated which amino acid residue 
in the constant V3, C3 and C4 regions was under positive selection 
pressure, by calculating the dN/ dS ratio for each position. Table 3 
lists the sites that have a mean dN/ dS ratio > 1 in variants from at 
least one of the disease stages. The number of amino acid positions 
under positive selection pressure is considerably higher in the C3 
region at the chronic disease stage. In fact, in this region, ten, 
twelve and three residues appear to be under positive selection in 
the virus variants from ES, CS and LS groups, respectively. Nine 
out of twelve residues under positive selection pressure in the 
variants from CS group cluster in the N-terminal portion of the C3 
domain, within the a2 helix (amino acids 335 to 352). Within the 
V3 domain, only three sites are under positive selection pressure 
(amino acids 300, 309 and 322) in variants from the CS group, two 
in variants from the ES group (amino acids 300 and 322) and two 
in variants from the LS group (amino acids 300 and 309). In the 
C4 domain only one site is under positive selection pressure 
(amino acid 429) in virus variants obtained at all disease stages. In 
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Figure 1 . V3-V5 phylogenetic tree of HIV-1 strains from 72 HIV-1 -infected patients in South Africa and Swaziland. Tree was rooted by 
using the midpoint rooting method. Scale bar at the bottom indicates 7% nucleotide substitutions per site. The asterisk along a branch represents 
the bootstrap value (significant statistical support) >70% and p<0.001 in the zero-branch-length test. Years of collection of samples are color- 
highlighted: 2005 in blue, 2006 in red and 2007 in green. Country where sample collection was performed: SW = Swaziland; SA = South Africa. 
doi:1 0.1 371 /journal.pone.00951 83.g001 



this region, an additional amino acid at position 446 appears to be 
under positive selection pressure only in the LS group. 

We have then calculated the strength of the positive selection 
pressure for each one of the identified residues. No statistically 
significant difference between the set of weighted mean values (ft),) 
among the disease groups was found for both the V3 and C4 
regions (data not shown). Conversely, for the C3 domain, a 
significant difference (p = 0.00016) was found between the set of 
weighted ft), of the ES and CS groups (Fig. 3), with sequences from 
variants of the ES group displaying residues with higher ft), values 
than the ones calculated for the CS group. 

Molecular visualization of sites under positive selection 
pressure 

In order to characterize the positions of the residues under 
positive selection in the gpl20 protein, we mapped them on the 
three-dimensional structure of the CD4-bound gpl20 derived 
from the HIV- 1 JR-FL strain [45] . Figures 4, 5 and 6 show the 
positions of the residues under positive selection in the C3, C4 and 
V3 regions (spheres in purple), respectively. Sites in C3 from 



variants obtained from the ES and CS groups are more numerous 
than those found in the variants from the LS group (Fig. 4). These 
sites are located in the 0t2 helix region (in blue), near the CD4- 
binding site and close to residues critical for the binding to CCR5 
(spheres in red). Residues within the C3 region that are critical for 
the binding to CD4 (amino acids 366 and 368, not shown in the 
figure, since hidden by the CD4 structure) are fully conserved in 
virus variants obtained at all the three disease stages, as also 
confirmed by the low Shannon Entropy values (data not shown). 
In contrast, the number of sites under positive selection pressure in 
the C4 region (Fig. 5) is small in the virus variants from all the 
three disease groups. Again, these residues are placed near the 
CD4-binding site or close to residues critical for CCR5 binding. 
Finally, the few sites under positive selection in the V3 region are 
all proximal to the highly conserved residues critical for CCR5 
binding (Fig. 6). 

Electric charge analysis and interaction between regions 

In order to investigate whether the electric charges of C3 and 
C4 regions of gpl20 (that are the principal components of the 
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Table 2. Number of PNGSs in gp120 V3, C3 and C4 regions of HIV-1 clade C variants from patients at different disease stages. 



Number of PNGSs 



Disease stage 




C3 


C4 


V3 


EC 

E3 




2 75 


1 71 


1 00 




Median 


3.00 


2.00 


1.00 




Range 


2-4 


0-2 


0-2 


cs 


Mean 


2.96 


1.71 


0.96 




Median 


3.00 


2.00 


1.00 




Range 


1-4 


1-2 


0-1 


LS 


Mean 


2.67 


1.83 


0.96 




Median 


3.00 


2.00 


1.00 




Range 


2-4 


1-2 


0-1 



ES = Early disease Stage; CS = Chronic disease Stage; LS = Late disease Stage. 

doi:l 0.1 371 /journal.pone.00951 83.t002 



CD4 binding site) can have an impact on the conformation of the 
CD4 binding site, we calculated the electric charge of each one of 
these regions and correlated it with the electric charges of all the 
other variable regions. After computing all the total electric charge 
(Qtot) correlations, we found statistically significant correlations 
between C3 and V4 and between G3 and V5 regions. These 
correlations are shown in figures 7 and 8. A negative correlation 
was present in the C3-V4 interaction (increased Q, c)t in V4 region 
and decreased Qj ot in C3 region) at all stages of disease (Fig. 7). 
This correlation was statistically significant in the LS group of 
patients (panel c, p = 0.0090). The electric charge analysis of the 
C3-V5 interaction showed a strong tendency to a positive 
correlation (increased Qj ot in both C3 and V5) in the virus 
variants from LS group of patients (panel f, p = 0.0526). This 
correlation was statistically significant when the interaction 
between electric charges of 0t2 helix (in C3 region) and V5 was 
considered (Fig. 8 panel f, p = 0.0192). 

The correlations of the electric charges between C3 and V4, G3 
and V5, a2 helix and V4, and a2 helix and V5 were then further 
characterised by means of contact maps, which allowed us to 
identify close spatial contacts between two regions. Figure 9 shows 
the contact maps for C3-V4 and C3-V5 (Panel A), and oc2 helix- 
V4 and 0t2 helix- V5 (panel B) in the crystallographic structure of 
the JR-FL gpl20 bound to CD4 [45]. As shown in panel A (top), 
C3 and V4, and C3 and V5 regions strongly interact (light blue 
area). In particular, amino acids in C3 at positions 360, 362 and 
364, which have been found under positive selection pressure in 
the virus variants obtained during the chronic disease stage, 
interact with N-terminal residues of V4 region (panel A, left). In 
the interaction between C3 and V5 (panel A, right) a single residue 
at position 360 of the C3 region, under positive selection pressure 
in the variants from the chronic disease stage, was found to 
interact with residues in the C-terminus of the V5 region. 

In the panel B of figure 9, the interactions of oc2 helix with V4 
and V5 regions are reported. Residues 335, 336, 337, 343, 346, 
347 and 350 in a2 helix, which have been found under positive 
selection pressure in the chronic disease stage, interact with 
residues in the C-terminal and N-terminal of V4 (left). Finally, a 
single residue in the 0(2 helix, at position 349, interacts with one 
residue in the C-terminal of the V5 region (right). 



Discussion 

The role of gpl20 variable regions in HIV escape from immune 
response has been previously investigated (9-17). However, the 
role of constant regions of gp 1 20 has not been thoroughly studied 
to date. In the present work we focused on those gpl20 constant 
regions of HIV-1 clade C which are key in the binding to CD4 
receptor and coreceptors and are involved in the processes of 
HIV-1 cell entry and in host immune response. Amino acid 
sequence lengths, PNGSs distribution, residues under positive 
selection pressure and electric charges within the V3, C3 and C4 
conserved domains of HIV-1 clade C variants obtained from 
individuals at different stages of disease, were investigated. The 
aim of the study was to evaluate if and how these sequence 
characteristics in these constant regions were modulated during 
the course of the disease. 

The study was performed on only patients naive for ART to 
avoid influence of antiretroviral therapy on variants selection. It 
therefore provides a clearer picture of mechanisms of HIV 
immune escaping in patients who are not treated with antiretro- 
viral drugs, a condition that, unfortunately, is still common in 
many poor developing countries where HIV is highly endemic. 

We observed that the V3 sequence length is well conserved in all 
the disease stages (35 amino acids, on average), as previously 
reported [Los Alamos website. Available: http://www.hiv. 
lanl.gov/content/sequence/HIV/COMPENDIUM/ 1999/7/ 
V3LoopEvolution.pdf. Accessed 2014 March 31, and 9, 23, 24], 
since this length possibly ensures the optimal conformation for the 
binding to CCR5 [48] . Virus tropism, predicted on the basis of the 
V3 sequence, does not switch from R5 to X4 at the later disease 
stage, as it has been instead described for the HIV-1 B subtype 
[25,48-51], although a greater, but not significant, number of X4 
strains have been observed at the late stage, when compared to 
early and chronic stages. 

The amino acid length of the C3 region ranged from 50 to 54 
residues. In line with previous studies, we confirmed that the C3 
domain in subtype C variants is relatively variable, in particular in 
its 0t2 helix portion, although sequence length variability among 
disease stages was not statistically significant. The cc2 helix sub- 
domain is more exposed to the solvent and to neutralizing 
antibodies in clade C than in clade B virus and thus exhibits a 
higher sequence entropy at the polar face [24] . Conversely, the C4 
region was constant in its amino acid sequence length (4 1 aa) at all 
the disease stages. This may be due to the fact that, although 
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N301 


N332 


N339 


N356 


N362 


N442 


N448 


ES 


0.958 


0.708 


0.750 


0.875 


0.083 


0.833 


0.792 


CS 


0.958 


0.708 


0.708 


0.875 


0.333 


0.750 


0.833 


LS 


0.958 


0.708 


0.917 


0.917 


0.125 


0.833 


0.833 



Figure 2. Distribution of PNGS in V3, C3 and C4 regions of gp120 in HIV-1 clade C variants obtained during different disease 
stages. Table at the bottom reports the frequency of presence of each PNGS. PNGS position in the alignment is according to the HXB2 amino acid 
sequence numbering. Circles highlight a significant increase of shifting PNGS frequency ES = Early disease Stage CS = Chronic disease Stage LS = 
Late disease Stage. 

doi:1 0.1 371 /journal.pone.00951 83.g002 
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Table 3. Amino acid sites 
different disease stages. 


under positive selection in 


gp!20 V3, C3 and C4 regions 


of HIV-1 clade C 


variants from patients at 




Mean dN/dS ratio 


Region 


Position in HXB2 


Position in JR-FL 


ES 


cs 


LS 


V3 


300 


297 


3 nc;A + n ^qq 


j.uDo_u.oyu 


^.0/0+ 1 .ZZj 




5no 


5nfi 

5UO 


*C1 


3 ni Q+n 71 r 
j.u i y _u./ i o 


4.D/0+ I ,ZZD 






3 1 7 
5 I / 


j.UjU _ U./U/ 


j.UDU_U.OyO 


"CI 




55 1 ; 


JJ 1 


i i _u.yyj 


j.jU/ _U. I ZZ 


< 1 




336 


332 


a on? + n ^qr 


3 ^n7+n 1 1 r 

J.JU/ _U. 1 1 o 


7 RfiR + 1 

/.OUO— 1 .JJU 




557 


555 


.oO'f _ U. / U J 


3 ^n^+n 1 ^r 

j.DUD _U. I jo 


< 1 




5A5 


55Q 


A QC\~) + fl ^QQ 

*+.yuz _ u.Dyy 


j.jU/ _U. I ZU 


< 1 




344 


340 


4.901 ±0.600 


3.507±0.118 


<1 




346 


342 


4.902 ±0.599 


3.507±0.118 


<1 




347 


343 


<1 


3.505±0.143 


<1 




350 


346 


4.885 ±0.648 


3.507±0.118 


<1 




352 


348 


<1 


3.494 ±0.220 


<1 




360 


355 


4.902±0.598 


3.504±0.145 


<1 




362 


357 


4.902±0.598 


3.507±0.118 


7.807± 1.552 




364 


359 


4.902±0.598 


3.507±0.120 


7.808± 1.550 


C4 


429 


420 


2.935±0.911 


5.321 ±1.659 


2.905±0.591 




446 


437 


<1 


<1 


2.893 ±0.604 


ES = Early disease Stage; CS 


= Chronic disease Stage; LS = 


Late disease Stage. 







doi:10.1371/joumal.pone.0095183.t003 




335 336 337 343 



344 346 350 

aa position 



360 362 364 



Figure 3. Strength of positive selection of sites in the C3 region during the early and chronic disease stages. The strength of positive 
selection (co) for each amino acid site is reported in the y axis. Sites are reported in the x axis. Amino acids numbering is according to HXB2 sequence. 
Grey and black bars show the strength of positive selection of each of the sites for the early infection stage (ES) and the chronic infection stage (CS), 
respectively. Statistical significance of the difference between the two series: p = 0.0001 6. 
doi:1 0.1 371 /journal.pone.00951 83.g003 
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Figure 4. Sites under positive selection pressure in C3 region at 
different disease stages. HIV-1 gp120 protein model is from JR-FL 
strain [45]. Sites under selection pressure are indicated as spheres in 
purple. Labeling of each positively selected site is centered on the oc- 
carbon of the amino acid. CD4 receptor is in green. Residues involved in 
CCR5 binding are indicated as spheres in red. The a2 helix region is 
indicated in blue. Amino acid numbering is according to HXB2 
sequence. ES = Early disease Stage CS = Chronic disease Stage LS 
= Late disease Stage. 
doi:1 0.1 371 /journal.pone.00951 83.g004 

located together with the C3 region in the outer domain of gpl20 
protein, C4 is buried into a hydrophobic pocket and therefore is 
less exposed to the selection pressure of the host immune response. 

The number of conserved PNGSs in the V3, C3 and C4 regions 
was constant at all the disease stages. All conserved PNGSs in 
these regions are found in all M group viruses with a frequency > 
70% [52]. A single fixed PNGS in V3, at position 301, involved in 
coreceptor binding, was observed in the N-terminal portion [53]. 
No shifting PNGSs in V3 were observed at all disease stages, in 
agreement with previous data indicating that in the V3 domain the 
presence of shifting PNGSs is very rare, since appearance of new 
PNGSs may interfere with co-receptor binding [10]. In addition, 
we found a low frequency of expression of the glycosylated residue 
295, located at the confluence of C2 and V3, within the target 
epitope of the 2G12 monoclonal antibody. Absence of the 295 
residue in the clade C variants explains the resistance of subtype C 
strains to the neutralising effects of this monoclonal antibody [54]. 
In the C3 domain, shifting PNGS were mainly observed within the 
ac2 helix and the C-terminal portions, downstream of the 
conserved PNGS N356. These amino acid positions are important 
targets of neutralizing antibodies and the generation of new 
PNGSs could provide a shield against the immune system 
response. Interestingly, we have identified a PNGS at position 
362 with a frequency of 33.3% in the CS group and with a much 
lower frequency (8.3%) in the ES and LS groups. As observed by 
Sterjovski et aL, this residue, adjacent to the CD4-binding site, is 
associated with an enhanced virus fusogenic capacity and entry 
[55]. Since an increase in length of the amino acid sequence of VI 
and V4 regions during the chronic stage of the disease may result 
in a reduced affinity to the CD4 receptor [9,56], we can speculate 




Figure 5. Sites under positive selection pressure in C4 region at 
different disease stages. HIV-1 gp120 protein model is from JR-FL 
strain [45]. Sites under selection pressure are indicated as spheres in 
purple. Labeling of each positively selected site is centered on the tx- 
carbon of the amino acid. CD4 receptor is in green. Residues involved in 
CCR5 binding are indicated as spheres in red. The oc2 helix region is 
indicated in blue. Amino acid numbering is according to HXB2 
sequence. ES = Early disease Stage CS = Chronic disease Stage LS 
= Late disease Stage. 
doi:10.1371/journal.pone.0095183.g005 

that the PNGS 362 in C3 may be selected during the chronic stage 
to compensate the reduced affinity to CD4. 

In the C4 region, two conserved PNGSs at positions 442 and 
448 and a few shifting PNGSs between these two residues, have 
been identified. The N442 residue was suggested to be involved in 
glycan shielding [57,58], whereas the N448 residue is known to be 
critical for the binding to CCR3 co-receptor, together with residue 
N356 in C3 [59]. The N448 residue has also been reported to 
protect the V3 loop from neutralizing antibodies and to promote 
processing and/or presentation of T-helper epitopes for antiviral 
T-cell response [60,61]. Since PNGSs in the C4 region are 
positioned close to the pocket of the CD4-binding site, the shifting 
PNGSs lying in that region may shield the virus from both 
neutralizing antibodies and the T-helper mediated immune 
response [14]. 

In an attempt to characterize other sites that could be involved 
in virus escaping, we identified those sites in V3, C3 and C4 that 
are under positive selection pressure at the different disease stages 
and mapped them onto the three-dimensional structure of the 
CD4-bound gpl20 from the JR-FL HIV-1 virus [45]. Most of 
these sites under positive selection pressure were found within the 
C3 region and, in particular, in the 0t2 helix portion during early 
and chronic disease stages. We also detected three sites in the V3 
region under positive selection pressure during the chronic disease 
stage, and two sites in both the early and late disease stages. The 
presence of sites under positive selection pressure in the oc2 helix 
region, as well as in the V3 region, indicates a strong diversifying 
selection of these regions, in a subtype-specific manner 
[19,29,30,62]. 
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Figure 6. Sites under positive selection pressure in V3 region at 
different disease stages. HIV-1 gp120 protein model is from JR-FL 
strain [45]. Sites under selection pressure are indicated as spheres in 
purple. Labeling of each positively selected site is centered on the ot- 
carbon of the amino acid. CD4 receptor is in green. Residues involved in 
CCR5 binding are indicated as spheres in red. The oc2 helix region is 
indicated in blue. Amino acid residue 300 in V3 is partially hidden by 
residue 441 of the C4 region, to which it is spatially close. Amino acid 
numbering is according to HXB2 sequence. ES = Early disease Stage CS 
= Chronic disease Stage LS = Late disease Stage. 
doi:10.1371/journal.pone.0095183.g006 

The presence of sites under positive selection pressure in C3 
region of HIV- 1 clade C subtype during the chronic disease stage 
has been described in literature [63]. However, we also found a 
strong reduction of sites under positive selection pressure in the C3 
region during the late disease stage. This is the first time, to our 
knowledge, that such an evidence is reported. Lastiy, in the C4 
region close to the CD4 binding site, two sites were found under 
positive selection pressure during the late stage of the disease and 
only one in the early and chronic disease stages. Of note, some of 
the sites under positive selection pressure were also found close to 
critical residues for binding to either CD4 or CCR5. 

The strength of positive selection was found to be significantly 
different among the disease stages only in the C3 region. In 
particular, it was found to be significantly higher at the early stage 
of the disease, when compared to the one at the chronic stage. 
Taken together these data lead us to speculate that during the 
early disease stage, when the immune response may be less 
developed compared to the chronic stage, more degrees of 
freedom for the three-dimensional structure of the viral particle 
surface are allowed. Conversely, during the chronic disease stage, 
the greater pressure generated by a more developed immune 
response may lead to the selection of more non-synonymous 
substitutions that are compatible with an increasingly constrained 
three-dimensional conformation of the gpl20 molecule in terms of 
viral fitness, since fewer degrees of freedom are allowed. The 
greater number of positive selected sites in the C3 region during 
the chronic disease stage, in comparison to those observed during 
the early stage, may be seen as a compensatory effect by which the 



virus counteracts the lower strength of positive selection in the CS 
stage with an increased number of positive selected sites in this 
stage. This effect may, in turn, benefit a higher resistance of the 
virus to the immune response, in particular to the action of 
neutralising antibodies. 

Since no significant differences in the number and the strength 
of positive selection were observed in both V3 and C4 regions at 
any disease stage, we propose that the C3 region could be a more 
critical target for positive selection than C4 and V3, as already 
suggested by Moore et al., using HIV-1 clade C chimera viruses 
[6]. In this regard, data in literature indicate that the 0t2 helix in 
HIV-1 subtype C is more exposed to the solvent than in HIV-1 
subtype B and that the residues with dN/dS >1 are associated 
with resistance to neutralization [24]. Electrical charge analysis 
revealed a significant strong negative correlation between the total 
electric charges of C3 and V4 in virus variants during the late 
stage of disease coupled with a strong positive correlation between 
the total electric charges of C3 (in particular the 0t2 helix) and V5 
during the late stage of the disease. Contact maps analysis showed 
that the N-terminus of the 0t2 helix and the C-terminus of the V4 
region are in close proximity. This may be related to the charge 
negative correlation between C3 and V4 (and oc2 helix and V4), 
seen in the variants at late disease stage, in agreement with 
previously published data [24]. Similarly, C3, in particular the 0t2 
helix, strongly interacts with V5, with a positive electric charge 
interaction. Overall, data on electric charge and interaction 
among regions suggest that during the course of the disease 
changes in electric charge and strong physical interactions can 
affect the conformation of the regions of gpl20. These changes 
may be instrumental in the virus escape mechanism. In fact, the 
negative value of the charge cross-correlation between C3 and V4 
in the late disease stage may indicate that an attractive force is 
mediated by such changes, thus maintaining a more compact and 
closed structure of the two regions. Contact maps indicate that C3 
and V4 are in close proximity and show a spatial preference for 
charged residues, suggesting that the negative correlation between 
C3 and V4 can have a physical explanation in terms of 
electrostatic interactions between these two regions. It could be 
assumed that the negative C3-V4 correlation is the hallmark of an 
electrostatic attraction between these two regions. This, in turn, 
induces a conformational change in both regions towards a more 
closed conformation that may protect the C3-V4 epitope from 
neutralizing antibodies. The absence of a significant C3-V4 
negative correlation during the chronic disease stage could be 
compensated by the presence of the PNGS at position 362 in the 
C-terminal region of C3, whose frequency is increased during 
chronic disease stage. The PNGS at position 362 is near the CD4 
binding site, and strongly interacts with the V4 region, as shown 
by the contact maps. The presence of the glycan residue at 
position 362 could, therefore, act as a shield against neutralizing 
antibodies. In late disease, the electric charge-mediated attraction 
between C3 and V4, which favors a more closed conformation, 
could be counter-balanced by the repulsive forces between C3 (in 
particular the oc2 helix) and V5, shown here. These repulsive forces 
could be responsible for a more open conformation of these two 
regions. Since the orientation of V5 regulates the opening of the 
CD4 binding pocket [55 and Cenci and D'Avenio, personal 
communication], these data support the hypothesis that, when the 
immune system becomes weaker with disease progression, a more 
open conformation in these two regions may increase the fitness 
for the binding to the CD4 cell receptor, as critical residues for 
CD4 binding have been shown to be present in C3, in addition to 
those present in C4 [30,31]. 
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Figure 7. Electric charge correlations between C3-V4 and C3-V5 regions. Statistically significant p values are in bold. Rho 
coefficient ES = Early disease Stage CS = Chronic disease Stage LS = Late disease Stage. 
doi:1 0.1 371 /journal.pone.00951 83.g007 



correlation 



It is also noteworthy that some sites under positive selection 
pressure are found to be involved in the interactions between C3 
and V4 (amino acids 360, 362 and 364 in C3) and between C3 
and V5 (amino acids 360 and 362 in C3), hinting at a possible role 
in conformational changes driven by the immune pressure. 
Similarly, the contact map between the a.2 helix and the V4 
region indicates that the sites under positive selection pressure 
within the ot2 helix (residues 335, 336, 337, 343, 346, 347 and 350) 
strongly interact with C terminal residues in V4 and with one 
residue at V4 N-terminal. It is plausible that the a2 helix structure, 
interacting with V4 through electric charge rearrangements, could 



relocate the neutralizing epitope in the C3-V4 region, thereby 
eluding the host's neutralizing antibodies. 

Conclusions 

We have identified, for the first time and in a large sample of 
virus variants, several distinguishing features in V3, C3 and C4 
regions of HIV clade C gp!20 protein. We have described the 
variability of these features during the course of the disease and 
speculated on possible implications in virus infectivity and 
resistance to the host's immune response. In particular, we 
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Figure 8. Electric charge correlations between «2 helix-V4 and «2 helix-V5 regions. Statistically significant p values are in bold. Rho = 

correlation coefficient ES = Early disease Stage CS = Chronic disease Stage LS = Late disease Stage. 

doi:10.1371/journal.pone.0095183.g008 



described the presence of sites in C3, C4 and V3 regions that are 
under positive selection pressure and are proximal to residues 
important for the binding to CD4 receptor or co-receptors, and 
mapped them in critical areas of contact between these regions. 
These sites may be instrumental to charge-associated conforma- 
tional changes that may play a role in the evolution of HIV. 

These findings may contribute significandy to the search for 
new immunogens suitable for developing a sterilising vaccine 
against HIV, as well as to the design of novel antiviral drugs. 



Accession codes. Sequences included in this paper have been 
deposited in GenBank under the accession numbers JN 121046 to 
JN121117. 
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Figure 9. Contact maps between C3-V4, C3-V5, a2 helix-V4 and a2 helix-V5 of gp120. Contact maps are generated from the 
crystallographic structure of the CD4-bound JR-FL2 gp120 [45]. Regions are indicated in violet and in pink. The boxed area in light blue is the contact 
area. Solid lines split the region in the N-terminal and C-terminal portions. In the contact areas, only amino acids that have been found under positive 
selection are shown (indicated by arrows and circled). Amino acid numbering is according to HXB2 sequence. Panel A) left: C3 region is reported 
horizontally (in violet), V4 region vertically (in pink). Panel A) right: C3 region is reported horizontally (in violet), V5 region vertically (in pink). Panel B) 
left: a2 helix is reported horizontally (in violet), V4 region vertically (in pink). Panel B) right: a2 helix is reported horizontally (in violet), V5 region 
vertically (in pink). 
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